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TECHNICAL FIELD 

The present invention relates to a sound acquisition method and a 
sound acquisition apparatus and, more particularly, to a sound acquisition 
method and a sound acquisition apparatus that acquire speech sounds from a 
10 plurality of speech sound sources and adjust their volume before outputting. 
PRIOR ART 

For example, in a teleconference in which persons at different remote 
locations participate, if a single microphone is used at each location to acquire 
speech soimds of plural participants sitting at different positions at each 

15 remote location, received signal levels greatly differ because of different 
distances from the participants to the microphone and different volumes of 
their speech sounds. At the remote receiving side the reproduced speech 
sounds greatly differ in volume with the participants at the transmitting side 
and, in some cases, they are hard to distinguish one participant from another. 

20 Fig. 1 7 illustrates in block form the basic configuration of a 

conventional sound acquisition apparatus disclosed, for example, in Japanese 
Patent Application Kokai Publication 8-250944. The conventional sound 
acquisition apparatus is made up of a microphone 41, a power calculating part 
42, an amplification factor setting part 43, and an amplifier 44. The power 

25 calculating part 42 calculates a long-time mean power Pave of the signal 

received by the microphone 41 . The long-time mean power can be obtained 
by squaring the signal and time-integrating the squared output. Next, the 
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amplification factor setting part 43 sets an amplification factor G based on the 
long-time mean power Pave of the received signal calculated by the power 
calculating part 42 and a preset desired sending level Popt. The amplification 
factor G can be calculated, for example, by the following equation (1), 

5 G=(Popt/Pave)''' (1) 

The amplifier 44 amplifies the microphone received signal by the set 
amplification factor G and outputs the amplified signal. 

By the processing described above, the output signal power is put to 
the desired sending level Popt, by which the volume is automatically adjusted. 
10 With the conventional sound acquisition method, however, since the 

amplification factor is determined based on the long-time mean power, a 
delay of several to tens of seconds develops in the setting of an appropriate 
amplification factor. Accordingly, in the case where plural speakers are 
present and their speech sounds are acquired by the microphone at different 
15 levels, there arises a problem that whenever the speakers changes from one to 
another, setting of an appropriate amplification factor delays, resulting in an 
the speech sound being reproduced at an inappropriate volume. 

An object of the present invention is to provide a sound acquisition 
apparatus and a soimd acquisition method that, even where plural speakers are 
20 present and their speech sounds are picked up by a microphone at different 
levels, automatically adjust the volume of each speech sound to an 
appropriate value, and a program for implementing the method. 

DISCLOSURE OF THE INVENTION 
25 A sound acquisition method for acquiring sound from each sound 

source by microphones of plural chsmnels according to the present invention, 
comprises: 
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(a) a state deciding step including an utterance deciding step of 
deciding an utterance period from signals received by said plural-channel 
microphones; 

(b) a sound source position detecting step of detecting the position of 
5 said each sound source from said received signals when the utterance period 

is decided in said utterance deciding step; 

(c) a frequency domain converting step of converting said received 
signals to frequency domain signals; 

(d) a covariance matrix calculating step of calculating a covariance 
10 matrix of said frequency domain received signals; 

(e) a covariance matrix storage step of storing said covariance matrix 
for each sound source based on the result of detection in said sound position 
detecting step; 

(f) a filter coefficient calculating step of calculating filter coefficients 
15 of said plural channels based on said stored covariance matrix and a 

predetermined output level; 

(g) a filtering step of filtering the received signals of said plural 
channels by filter coefficients of said plural channels, respectively; and 

(h) an adding step of adding together the results of filtering in said 
20 plural channels, and providing the added output as a send signal. 

According to the present invention, a sound acquisition apparatus 
which acquires sound from each sound source by microphones of plural 
channels placed in an acoustic space, comprises: 
25 a state decision part including an utterance deciding part for deciding 

an utterance period from signals received by said plural-channel microphones; 

a sound source position detecting part for detecting the position of said 
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each sound source from said received signals when the utterance period is 
decided by said utterance deciding part; 

a frequency domain converting part for converting said received 
signals to frequency domain signals; 
5 a covariance matrix calculating part for calculating a covariance 

matrix of said frequency domain received signals of said plural channels; 

a covariance matrix storage part for storing said covariance matrix for 
said each sound source based on the result of detection by said sound position 
detecting part; 

10 a filter coefficient calculating part for calculating filter coefficients of 

said plural channels by use of said stored covariance matrix so that the send 
signal level for said each sound source becomes a desired level; 

filters of said plural channels for filtering the received signals from 
said microphones by use of the filter coefficients of said plural channels, 
15 respectively; and 

an adder for adding together the outputs from said filters of said plural 
channels and for providing the added output as a send signal. 

According to a second aspect of the present invention, a sound 
acquisition method for acquiring speech sound from at least one sound source 
20 by a microphone of at least one channel in an acoustic space in which a 
received signal is reproduced by a loudspeaker, comprises: 

(a) a state deciding step of deciding an utterance period and a 
receiving period from the sound acquired by said microphone of said at lest 
one channel and said received signal; 
25 (b) a frequency domain converting step of converting said acquired 

signal and said received signal to frequency domain signals; 

(c) a covariance matrix calculating step of calculating a covariance 
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matrix in said utterance pcnpd and a covariance in said receiving period from 
said frequency domain acquired signal and received signal; 

(d) a covariance matrix storage step of storing said covariance 
matrices for said utterance period and for said receiving period, respectively; 
5 (e) a filter coefficient calculating step of calculating filter coefficients 

for said acquired signal of said at least one channel and filter coefficients for 
said received signal based on said stored covariance matrices in said utterance 
period and said receiving period so that an acoustic echo, which is a received 
signal component contained in said received signal is cancelled; 
10 (f) a filtering step of filtering said received signal and said acquired 

signal by use of said filter coefficients for said received signal and filter 
coefficients for said acquired signal of said at least one channel; and 

(g) an adding step of adding together said filtered signals and 
providing the added output as a send signal. 
15 A sound acquisition apparatus according to the second aspect of the 

present invention comprises: 

a microphone of at least one channel for acquiring speech sound from 
a sound source and for outputting an acquired signal; 

a loudspeaker for reproducing a received signal; 
20 a state decision part for deciding an utterance period and a receiving 

period from said acquired signal and said received signal; 

a frequency domain converting part for converting said acquired 
signal and said received signal to frequency domain signals; 

a covariance matrix calculating part for calculating covariance 
2 5 matrices of said acquired and received signals of said frequency domain for 
said utterance period and for said receiving period, respectively; 

a covariance matrix storage part for storing said covariance matrices 
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for said utterance period and for said receiving period, respectively; 

a filter coefficient calculating part for calculating filter coefficients for 
said acquired signal of said at least one channel and filter coefficients for said 
received signal based on said stored covariance matrices so that an acoustic 
5 echo of said received signal is cancelled; 

an acquired signal filter and a received signal filter having set therein 
said filter coefficients for said acquired signal and said filter coefficients for 
said received signal, for filtering said acquired signal and for filtering said 
received signal; and 
10 an adder for adding together the ou^uts fi:*om said acquired signal 

filter and said received signal filter, and for providing the added output as a 
send signal. 

According to the present invention, even when plural speakers are 
present ad their speech sounds are acquired by a plurality of microphones at 
15 different levels, the directivity of the microphones is appropriately controlled 
to automatically adjust the volume of the speech sound to an appropriate 
value for each speaker. 



BRIEF DESCRIPTION OF THE DRAWINGS 
20 Fig. 1 is a block diagram illustrating a sound acquisition apparatus 

according to a first embodiment of the present invention. 

Fig. 2 is a block diagram showing an example of the configuration of a 
state decision part 14 in Fig. 1. 

Fig. 3 is a block diagram showing an example of the configuration of a 
2 5 sound source position detecting part 1 5 in Fig. 1 . 

Fig. 4 is a block diagram showing an example of the configuration of a 
filter coefficient calculating part 2 1 in Fig. 1 . 
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Fig. 5 is a flowchart showing a first example of a sound acquisition 
method using the sound acquisition apparatus of Fig. 1. 

Fig. 6 is a flowchart showing a second example of a sound acquisition 
method using the sound acquisition apparatus of Fig. 1. 
5 Fig. 7 is a flowchart showing a third example of a sound acquisition 

method using the sound acquisition apparatus of Fig. 1 . 

Fig. 8 is a block diagram illustrating a sound acquisition apparatus 
according to a second embodiment of the present invention. 

Fig. 9 is a block diagram showing an example of the configuration of 
10 the state decision part 14 in Fig. 8. 

Fig. 10 is a block diagram illustrating a sound acquisition apparatus 
according to a third embodiment of the present invention. 

Fig. 1 1 is a block diagram showing an example of the configuration of 
the state decision part 14 in Fig. 7. 
15 Fig. 12 is a block diagram illustrating a sound acquisition apparatus 

according to a fourth embodiment of the present invention. 

Fig. 13 is a block diagram illustrating a sound acquisition apparatus 
according to a fifth embodiment of the present invention. 

Fig. 14 is a block diagram showing an example of the configuration of 
20 a weighting factor setting part 21H in Fig. 4. 

Fig. 15 is a block diagram showing another example of the 
configuration of a weighting factor setting part 21H in Fig. 4. 

Fig. 16 is a block diagram showing an example of the configuration of 
a whitening part 21J in Fig. 4. 
25 Fig. 17 is a block diagram an example of a co variance matrix storage 

part 1 8 that is used when each embodiment is equipped with a covariance 
matrix averaging function. 
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Fig. 1 8 A is a diagram showing simulated speech waveforais of 
speakers A and B before processing by the first embodiment. 

Fig. 1 8B is a diagram showing simulated speech waveforms of 
speakers A and B after processing by the first embodiment. 

Fig. 19 is a diagram showing received and send speech waveform by 
simulation, which show acoustic echo and noise cancellation according to a 
third embodiment. 

Fig, 20 is a block diagram illustrating a conventional sound 
acquisition apparatus. 

BEST MODE FOR CARRYING OUT THE INVENTION 
FIRST EMBODIMENT 

Fig. 1 is a block diagram of a sound acquisition apparatus according to 
a first embodiment of the present invention. 

The sound acquisition apparatus of this embodiment comprises 
microphones lli to 11m of M channels disposed in an acoustic space, filters 
12i to 12m, an adder 13, a state decision part 14, a sound source position 
detecting part 15, a frequency domain converting part 16, a co variance matrix 
calculating part 17, a covariance matrix storage part 18, an acquired sound 
level estimating part 19, and a filter coefficient calculating part 21. 

In this embodiment, the positions of speech sound sources 9 1 to 9k in 
an acoustic space are detected, then covariance matrices of acquired signals in 
the frequency domain for the respective speech sound sources are calculated 
and stored, and these covariance matrices are used to calculate filter 
coefficients. These filter coefficients are used to filter the signals acquired 
by the microphones, thereby controlling the signals from the respective 
speech sound sources to have a constant volume. In this embodiment, let it 
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be assumed that the output signals from the microphones 1 1 1 to 1 1m are 
digital signals into which the signals acquired by the microphones are 
converted by a digital-to-analog converter at a predetermined sampling 
frequency, though not shown in particular. This applies to other 
5 embodiments of the present invention. 

In the first place, the state decision part 14 detects an utterance period 
from each of the received signals by the microphones 1 1 1 to 1 1m- For 
example, as shown in Fig. 2, in the state decision part 14 all the received 
signals from the microphones 1 1 1 to 1 1m are added together by an adding part 

10 14 A, then the added output is applied to a short-time mean power calculating 
part 14B and a long-time mean power calculating part 14C to obtain a 
shirt-time mean power (approximately in the range of 0. 1 to 1 s, for instance) 
Pavs and a long-time mean power (approximately in the range of 1 to 100 s, 
for instance) PavL, respectively, then the ratio between the short-time mean 

15 power and the long-time mean power, Rp=Pavs/PavL5 is calculated in a division 
part 14D, and in an utterance decision part 14E the power ratio Rp is 
compared with a predetermined utterance threshold value Rthu; if the power 
ratio exceeds the threshold value, then the former is decided as indicating the 
utterance period. 

20 When the decision result by the state decision part 14 is the utterance 

period, the sound source position detecting part 15 estimates the position of 
the sound source. A method for estimating the sound source position is, for 
example, a cross-correlation method. 

Let M (M being an integer equal to or greater than 2) represent the 

25 number of microphones and Xy represent a measured value of the delay time 
difference between signals acquired by i-th and j-th microphones Hi and 1 Ij. 
The measured value of the delay time difference between the acquired signals 
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can be obtained by calculating the cross-correlation between the acquired 
signals and detecting its maximum peak position. Next, let the sound 
acquisition position of an m-th (where m=l5 , . M) microphone be 
represented by (Xm, ym, z^) and an estimated sound source position by 
5 (x, Y, z). A measured value x-- of the delay time difference between the 

acquired signals, which is available from these positions, is expressed by Eq. 
(2). 

h = - V(Xi - X)2 + (y i - Yf + (Zi - Zf - - - Xf + (y j - Y)^ + (zj - Zf 
c c 

(2) 

10 where c is sound velocity. 

Next, measured and estimated value x-^ and x^j of the delay time 

difference between the acquired signals are multiplied by the sound velocity c 
for conversion into distance values, which are used as measured and estimated 
values d-- and d-j of the difference in the distance to the uttering sound 

15 source between the positions of microphones acquiring the speech sound 
therefrom, respectively; a mean square error e(q) of these values is given by 
Eq. (3). 



M-l M ^ |2 

e(q)= I Z 
i=i j=i+i 



dij-diji 



M-l M 
i=l j=i+l 



M-l M , ,2 
i=l j=i+r 

where q = (x, Y,z). r; and rj represent the distances between the estimated 
sound source position q = (x, Y, z) and the microphones 1 1 j and 1 1 j. 

By obtaining a solution that minimizes the mean square error e(q) of 
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Eq. (3), it is possible to obtain estimated sound source position that minimizes 
the error between the measured and estimated values of the delay time 
difference between the acquired signals. In this instance, however, since Eq. 
(3) is nonlinear simultaneous equations and is difficult to solve analytically, 
the estimated sound source position is obtained by a numerical analysis using 
successive correction. 

To obtain the estimated sound source position (x,Y,z) that 
minimizes Eq. (3), the gradient at a certain point in Eq. (3) is calculated, then 
the estimated sound source position is corrected in the direction in which to 
reduce the error until the gradient becomes zero; accordingly, the estimated 
sound source position is corrected by repeatedly calculating the following 
equation (4) for u=0, 1 , 

q(u+i) =q(u) - a-grade(q)|^^^^^^ (4) 

where a is a step size of correction, and it is set at a value a>0. q(u) 
represents q corrected u times, and q(o)=(Xo,Yo,Zo) is a predetermined 
arbitrary initial value when u=0. grad represents the gradient, which is 
expressed by the following equations (5) to (10). 



I 5X 5Y dL ) 



dX i=i j=i+i 



ae(q) _ M-l M 

ax 

aY i=i j=i+i 

oL i=i j=i+i 



rj 




Yi-Y 


yj-Y 






Zj -Z 


Zj -Z 

-— 



(5) 
(6) 
(7) 
(8) 



rj = V(Xi - X)2 + (y i - Yf + (zj - Zf 



(9) 



t 



rj = j - ^ (y j - + (z j - Zf (10) 

As described above, by repeatedly calculating Eq. (4), it is possible to 
obtain the estimated sound source position q=(x,Y,z) where the error is 
minimized. 

5 Fig. 3 illustrates in block form the functional configuration of the 

sound source position detecting part 15. In this example, the sound source 
position detecting part 15 comprises a delay time difference measuring part 
15 A, a multiplier 15B, a distance calculating part 15C, a mean square error 
calculating part 15D, a gradient calculating part 15E, a relative decision part 

10 1 5F, and an estimated position updating part 15G. 

The delay time difference measuring part 1 5 A measures, during 
utterance from one speech sound source 9k, the delay time difference Xy by the 
cross-correlation scheme for every pair (i, j) of 
i= 1,2, ...,M-1; 

15 j = i+l,i+2, ...,M 

based on the received signals by the microphones Hi and 1 Ij. The multiplier 
1 5B multiplies each measured delay time difference Xy by the sound velocity c 
to obtain the difference in distance, dy, between the sound source and the 
microphones 1 Ij and 1 Ij. The distance calculating part 15C calculates, by 

20 Eqs. (9) and (10), the distances r\ and rj between the estimated sound source 
position (x, Y,z) fed from the estimated position updating part 15G and the 
microphones Hi and 1 Ij. In this case, however, the estimated position 
updating part 15G provides an arbitrary initial value (Xq, Yq^Zo) as a first 
estimated sound source position to the distance calculating part 15C- The 

25 mean square error calculating part 15D uses dy, r\ and rj to calculate the mean 
square error e(q) by Eq. (3) for all of the above-mentioned pairs (i, j). The 
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gradient calculating part 1 5F uses the current estimate sound source position 
and djj, ri, rj to calculate the gradient grad e(q) of the mean square error e(q) 
by Eqs. (6), (7) and (8). 

The relative decision part 1 5F compares each element of the gradient 
5 grad e(q) of the mean square error with a predetermined threshold value Cth to 
decide whether every element is smaller than the threshold value Cth, and if so, 
then outputs the estimated sound source position (x, Y,z) at that time. If 
every element is not smaller than Cth, then the estimated position updating part 
15G uses the gradient grad e(q) and the current estimated position q = 

10 (x, Y,z) to update the estimated position by Eq. (4), and provides th updated 
estimated position qu+i = (x, Y,z) to the distance calculating part 15C. The 
distance calculating part 15C uses the updated estimated position (x, Y,z) 
and dij to calculate ri and rj updated in the same manner as referred to 
previously; thereafter, the mean square error calculating part 15D updates e(q), 

15 then the gradient calculating part 15E calculates the updated grad e(q), and 
the relative decision part 15F decides whether the updated mean square error 
e(q) is smaller than the threshold value Cth- 

In this way, updating of the estimated position (x, Y,z) is repeated 
until every element of the gradient grad e(q) of the mean square error 

20 becomes sufficiently small (smaller than Cth), thereby estimating the position 
(x, Y,z) of the sound source 9k. Similarly, positions of other sound sources 
are estimated. 

The frequency domain converting part 16 converts the signal acquired 
by each microphone to a frequency domain signal. For example, the 
25 sampling frequency of the acquired signal is 16 kHz, and acquired signal 

samples from each microphone 11m (m=l, . .., M) are subjected to FFT (Fast 
Fourier Transform) processing every frame of 256 samples to obtain the same 
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number of frequency domain signal samples Xm(co). 

Next, the covariance matrix calculating part 17 calculates the 
covariance of the microphone acquired signals and generates a covariance 
matrix. Letting Xi((d) to XmCco) represent the frequency domain converted 
signals of the microphone acquired signals obtained by the frequency domain 
converting part 16 for each sound source 9k, an MxM covariance matrix 
Rxx(o>) of these signals is generally expressed by the following equation (11). 



Rxx(<o) = 



Xi(co) 



Xm(cd) 



(x,(a))* ••• Xm((o)*) 



( X,(to)Xi(co)* Xi(a))X2(co)* 
X2(G))Xi(oi))* X2(ci>)X2(a))* 



X,(co)XM(co)*^ 
X2(co)Xm(g>)* 

Xm(o>)Xm(o>)*J 



(11) 



(co)Xi(co)* .Xm(co)X2(co)* 

10 where * represents a complex conjugate. 

Next, the covariance matrix storage part 1 8 stores, based on the result 
of detection by the sound source position detecting part 15, the covariance 
matrix Rxx(g)) as an MxM covariance matrix RskSkC^)) for each sound source 

15 Letting Ak(co)=(aki(G)), . • ., aicM(co)) represent the weighted mixing 

vectors for M-channel acquired signals for each sound source 9ic, the acquired 
sound level estimating part 19 calculates the acquired sound level Psk for each 
sound source by the following equation (12) using the covariance matrix 
RskSk(G>) of the acquired signal for each sound source 9^ stored in the 
20 covariance matrix storage part 18. 

1 w 



Psk =T^ ZAk(CD)»RskSk(C0)Ak(C0) 
W co=0 



(12) 
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In the above, the weighted mixing vector is expressed as a vector 
Ak(co)=(aki(co), . aicM(o3)) that has a controllable frequency characteristics, 
but if no frequency characteristics control is effected, the elements of the 
vector Ak may be preset values aki, ak2, - . ^kM- For example, the elements 
5 of the weighted mixing vector Ak for each sound source 9k are given greater 
values as the microphones corresponding to the elements become closer to the 
sound source 9k. In the extreme, it is possible to set 1 for only the element 
corresponding to the microphone 11m closest to the sound source 9k and set 0 
for all the other elements like Ak=(0, ...0, akm^ljO,. . .,0). In the following 

10 description, aki(G)), . . ., akM(G)) will be expressed simply as aki, . . .„ akM? for the 
sake of brevity, 

in Eq. (12) represents a complex conjugate transpose, and 
Ak(a))"Rsksk(G>)Ak(ca) can be expanded as given by the following equation. 
Ak(co)"Rsksk(G>)Ak((o) 

15 = ak,*(akiXi(a))Xi(a))*+ ak2X2(a))X,(cD)*+...+ akMXM(a))X,((D)*) 
+ ak2*(akiX,((o)X2(a))*4- ak2X2(a))X2(o))*+...+ akMXM(co)X2(co)*) 

+ akM*(akiX,(a))XM(co)*+ ak2X2((o)XM(co)*+...+akMXM(G))XM(o))*) 
=a(co) (13) 
20 Eq. (12) means that the mean power Psk of the acquired signal is calculated by 
adding the power spectrum sample value represented by Q((o) given by Eq. 
(13) for bands 0 to W (sample number) of the frequency domain signal 
generated by the frequency domain converting part 16 and then dividing the 
added value by W. 

25 For example, assuming that the microphone 1 1 1 is the closest to the 

sound source 9|, the value of the weighting factor aki is so determined as to 
assign the maximum weight to the acquired signal by the microphone 1 1 1 (a 
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first channel) and the values of weighting factors ak2, ai^, . . a^M for the 
acquired signals of the other channels are determined smaller than aki. With 
such weighting scheme, it is possible to increase S/N of the acquired signal 
from the sound source 9\ or lessen the influence of room reverberation more 
5 than in the case where such weighting is not performed. That is, the 

optimum value of the weighting factor of the weighted mixing vector Ak(a)) 
for each sound source 9k is predetermined experimentally by the directivity 
and layout of microphones and the layout of sound sources in such a manner 
as to increase S/N of the output speech signal corresponding to the sound 

10 source 9k, for example, and decrease the room reverberation. According to 
the present invention, however, even when equal weighting is done in all the 
channels, acquired signals fi-om the respective sound sources can be 
controlled to a desired level. 

Next, the filter coeflRcient calculating part 2 1 calculates filter 

15 coeflRcients for acquiring speech sound fi*om each sound source in a desired 
volume. In the first place, let Hi(co) to Hm(co) represent fi-equency domain 
converted versions of filter coefficients of the filters 12i to 12m each 
connected to one of the microphones. Next, let H((d) represent a matrix of 
these filter coefficients by the following equation (14). 

^Hi(o>)> 

: (14) 

Further, let Xsk,i to Xsk,M represent fi"equency domain converted 
signals of the signals acquired by respective microphones during utterance of 
the k-th sound source 9k- 

In this case, the condition that the filter coefficient matrix H(co) needs 
25 to satisfy is that when the microphone acquired signals are subjected to 



20 H(co) = 
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filtering with the filter coefficient matrix H((o) and the filtered signals are 
added together, the signal component from each sound source has a desired 
level Popt. Accordingly, the following equation (15) is an ideal condition by 
which the signal obtained by adding the filtered signals from the sound source 
5 9k becomes equivalent to a signal obtained by multiplying the weighted 

mixing vector Ak(o)) for the acquired signals from the microphones 1 1 1 to 11m 
by a desired gain. 

(Xsk,i(co) - Xsk,M(o>))H(co) = j^(Xsk,i(o)) Xsk,M(«))Ak(G>) (15) 

V ^Sk 

where 1^1, . . ., K, K representing the number of sound sources. 
10 Next, solving the condition of Eq. (15) by the least square method for 

the filter coefficient matrix H(co) gives the following equation (16). 



r K , K 



p. 



H(co) = \ ZCskRskSk(co)|^ Z Csk J^RskSk(co)Ak(co) (16) 

lk=l J k=l V *^Sk 

where Csk is a weighting factor that imposes a sensitivity restraint on the k-th 
sound source position. The sensitivity restraint mentioned herein means 

15 flattening the frequency characteristics of the present sound acquisition 
apparatus with respect to the sound source position. An increase in this 
value increases the sensitivity restraint on the sound source concemed, 
permitting sound acquisition with flatter frequency characteristics but 
increasing deterioration of frequency characteristics for other sound source 

20 positions. Hence, it is preferable that Csk be normally set at a value 

approximately in the range of 0. 1 to 10 to impose well-balances restraints on 
all the sound sources. 

Fig. 4 illustrates in block form the functional configuration of the filter 
coefficient calculating part 2 1 for calculating the filter coefficients expressed 

25 by Eq. (16). In this example, the covariance matrices Rsisi to Rsksk 
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corresponding to the respective sound sources 9\ to 9k, provided from the 
covariance matrix storage part 18, are applied to multipliers 21A1 to 21 Ak, 
wherein they are multiplied by weighting factors Csi to Csk set by a 
weighting factor setting part 21H, respectively. The acquired sound levels 
5 Psi to PsK for the sound sources 9\ to 9k, estimated by the acquired sound 

level estimating part 19, are provided to square ratio calculating parts 21B1 to 
21BK, wherein square ratios, (Popt/Psi)^^ to (Popt/PsK)*^^j between them and the 
predetermined desired output level Popt are calculated, and the calculated 
values are provided to multipliers 21 CI to 21CK for multiplication by the 

10 outputs from multipliers 21 Al to 21AK, respectively. The outputs from the 
multipliers 21C1 to 21CK are fed to multipliers 21D1 to 21DK, where they 
are further multiplied by weighted mixing vectors Ai(co) to Ak(co), and a 
matrix of the total sum of the multiplied outputs is calculated by an adder 2 IE. 
On the other hand, a matrix of the sum total of the outputs from the 

15 multipliers 21 Al to 21AK is calculated by an adder 2 IF, and by an inverse 
matrix multiplier 2 1 Q an inverse matrix of the matrix calculated by the adder 
2 IF and the output from the adder 2 IE are multiplied to calculate the filter 
coefficient H(o)). 

Next, the filter coefficients Hi(co), H2(co), .. ., Hm(o>) calculated by the 
20 filter coefficient calculating part 21 are set in the filters 12i to 12m for filtering 
the acquired signals from the microphones 1 1 1 to 1 1m, respectively. The 
filtered signals are added together by the adder 13, from which the added 
output is provided as an output signal. 

A description will be given below of three examples of the usage of 
2 5 the sound acquisition apparatus according to the present invention. 

A first method begins, as shown in Fig. 5, with initial setting of the 
number K of sound sources at K=0 in step S 1 . This is followed by step S2 
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in which the state decision part 14 periodically makes a check for utterance, 
and if utterance is detected, the sound source position detecting part 15 
detects the position of the sound source concerned in step S3. In step S4 it is 
decided whether the detected sound source position matches any one of those 
5 previously detected, and if a match is found, the covariance matrix Rxx(co) 
corresponding to that sound source position is newly calculated in the 
covariance matrix calculating part 17 in step S5, and in step S6 the covariance 
matrix in the corresponding area of the covariance matrix storage part 18 is 
updated with the newly calculated covariance matrix. 

10 When no match is found with the previously detected sound source 

position in step S4, K is incremented by 1 in step S7, then in step S8 the 
covariance matrix Rxx(co) corresponding to that sound source position is 
newly calculated in the covariance matrix calculating part 1 7, and in step S9 
the covariance matrix is stored in a new area of the covariance matrix storage 

15 part 18. 

Next, in step SIO the acquired sound level is estimated from the stored 
covariance matrix in the acquired sound level estimating part 19, then in step 
SI 1 the estimated acquired sound level and the covariance matrix are used to 
calculate the filter coefficients Hi(co) to Hm((o) by the filter coefficient 

20 calculating part 17, and in step 812 the filter coefficients set in the filters 12\ 
to 12m are updated with the newly calculated ones. 

A second method begins, as shown in Fig. 6, with presetting the 
maximum value of the sound source number at K^ax and presetting the initial 
value of the sound source number K at 0 in step S 1 . The subsequent steps 

25 S2 to S6 are the same as in the case with Fig. 5; that is, the microphone output 
signals are checked for utterance, and if utterance is detected, then its sound 
source position is detected, then it is decided whether the detected sound 
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source position matches any one of those previously detected, and if a match 
is found, the covariance matrix corresponding to that sound source position is 
calculated and stored as a newly updated matrix in the corresponding storage 
area. 

5 When it is found in step S4 that the detected sound source position 

does not match any one of previously detected positions, K is incremented by 
1 in step S7, and in step S8 a check is made to see if K is larger than the 
maximum value Kmax- If it does not exceed the maximum value Kmax» then 
the covariance matrix for the detected position is calculated in step S9, and in 

10 step SIC the covariance matrix is stored in a new area. When it is found that 
in step S8 that K exceeds the maximum value K^ax, K=Kniax is set in step S 1 1, 
then in step S12 the most previously updated one of the covariance matrices 
stored in the covariance matrix storage part 18 is erased, and a new 
covariance matrix calculated by the covariance matrix calculating part 17 in 

15 step S13 is stored in that area in step SI 4. The subsequent steps S15, S16 
and S17 are the same as in steps SI 0, SI 1 and S12 in Fig. 5; that is, the 
estimated acquired sound level for each sound source is calculated from the 
covariance matrix, and filter coefficients are calculated and set in the filters 
12i to 12m- This method is advantageous over the Fig. 5 method in that the 

20 storage area of the covariance matrix storage part 18 can be reduced by 
limiting the maximum value of the sound source number K to Kmax- 

In the first and second methods, described above, each detection of 
speech sound is always accompanied by the calculation and storage of a 
covariance matrix and updating of the filter coefficients, but the third method 

25 described below does not involve updating of the filter coefficients when the 
position of the sound source of the detected utterance matches any one of the 
already detected sound source positions. Fig. 7 illustrates the procedure of 
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the third method. In step S 1 the initial value of the sound source number k is 
set to 0, then in step S2 the state detecting part 14 periodically makes a check 
for utterance, and if utterance is detected, the sound source position detecting 
part 15 detects the position of the sound source of the detected utterance in 
5 step S3. In step S4 it is decided whether the detected sound source position 
matches any one of the already detected sound source positions, and if a 
match is found, the procedure retums to step S2 without updating. If no 
match is found with any one of the already detected sound source positions in 
step S4, that is, if the sound source 9^ moves to a position different from that 

10 where it was previously, or if a new sound source is added, K is incremented 
by 1 in step S5, then in step S6 the covariance matrix Rsksk(G>) corresponding 
to the soxmd source is newly calculated in the covariance calculating part 17, 
and in step S7 it is stored in the corresponding new area MA^ of the 
covariance storage part 18, then in step S8 the covariance matrix is used to 

15 estimate the acquired sound level by the acquired sound level estimating part 
19, then in step S9 all the covariance matrices and estimated acquired sound 
levels to calculate updated filter coefficients by the filter coefficient 
calculating part 21, and in step SIO the updated filter coefficients are set in 
the filters 12] to 12m, followed by a return to step S2. 

20 As described above, according to the present invention, the sound 

source positions are estimated from the acquired signals of a plurality of 
microphones, then a covariance matrix of the acquired signal is calculated for 
each sound source, then filter coefficients for adjusting the sound volume for 
each sound source position are calculated, and the filter coefficients are used 

25 to filter the acquired signals of the microphones, by which it is possible to 
obtain an output signal of a volume adjusted for each speaker's position. 

While the Fig. 1 embodiment has been described with reference to the 



22 

case where the sound source position detecting part 1 5 estimates the 
coordinate position of each sound source 9^, it is also possible to calculate the 
sound source direction, that is, the angular position of each sound source 9k to 
the arrangement of the microphones 1 1 1 to 11m. A method for estimating the 
5 sound source direction is set forth, for example, in Tanaka, Kaneda, and 
Kojima, "Performance Evaluation of a Sound Source Direction Estimating 
Method under Room Reverberation," Journal of the Society of Acoustic 
Engineers of Japan, Vol. 50, No. 7, 1994, pp. 540-548. In short, a covariance 
matrix of acquired signals needs only to be calculated for each sound source 

10 and stored. 

SECOND EMBODIMENT 

Fig. 8 is a functional block diagram of a sound acquisition apparatus 
according to a second embodiment of the present invention. 

The sound acquisition apparatus of this embodiment comprises 

15 microphones 11 1 to 11m, filters 12i to 12m, an adder 13, a state decision part 
14, a sound source position detecting part 15, a frequency domain converting 
part 16, a covariance matrix calculating part 17, a covariance matrix storage 
part 18, an acquired sound level estimating part 19, and a filter coefficient 
calculating part 2 1 . 

20 This embodiment adds an effect of noise reduction to the acquired 

sound level adjustment of the sound acquisition apparatus according to the 

first embodiment of the present invention. 

In the first place, the state decision part 14 detects an utterance period 

and a noise period from the power of the received signals from the 
25 microphones 1 1 1 to 1 Im- The state decision part 14 includes, as shown in 

Fig. 9, a noise decision part 14F added to the state decision part 14 of Fig. 2. 

For example, as is the case with the first embodiment, a short-time mean 
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power Pavs ^ long-time mean power PavL are calculated by the short-time 
mean power calculating part 14B and the long-time mean power calculating 
part 14C for the acquired signals from the respective microphones, then the 
ratio, Rp=Pavs/PavL? between the short-time mean power and the long-time 
5 mean power is calculated in the division part 14D, then the ratio is compared 
with an utterance threshold value Pthu in the utterance decision part 14E, and 
if the power ratio exceeds the threshold value, it is decided as indicating the 
presence of the utterance period. The noise decision part 14F compares the 
power ratio Rp with a noise threshold value PthWj and if the power ratio is 
10 smaller than the threshold value, it is decided as indicating the presence of the 
noise period. 

When the result of decision by the utterance decision part 14E is 
indicative of the utterance period, the sound source position detecting part 15 
detects the position of the sound source concerned in the same way as in the 

15 first embodiment of the present invention. 

Next, the frequency domain converting part 16 converts acquired 
signals from the microphones 1 1 1 o 1 1m in the utterance period and in the 
noise period of each sound source 9k into frequency domain signals, and 
provides them to the covariance matrix calculating part 17. The co variance 

20 matrix calculating part 17 calculates a covariance matrix Rsksk(ca) of the 
frequency domain acquired signals for the sound source 9k in the same 
manner as in the first embodiment of the present invention. Further, the 
covariance matrix calculating part calculates a covariance matrix Rnn(co) of 
the frequency domain acquired signals in the noise period. 

25 The covariance matrix storage part 18 stores, based on the result of 

detection by the sound source position detecting part 1 5 and the result of 
decision by the state decision part 15, the covariance matrices Rsksk(co) in the 
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utterance period and the covariance matrices Rnn((o) in the noise period for 
the sound sources 9i, 9k in areas MAi, MAk, MAk+i- 

The acquired sound level estimating part 19 estimates the acquired 
soimd level Psk for each sound source in the same manner as in the first 
5 embodiment of the present invention. 

Next, the filter coefficient calculating part 2 1 calculates filter 
coefficients for acquiring sound from each sound source 9k at a desired 
volume and for reducing noise. In the first place, the condition for noise 
reduction is calculated. Let the frequency domain converted signals of the 

10 microphone acquired signals in the noise period be represented by Xn,i(o)) to 
Xn,m(g>). If the microphone acquired signals Xn,i(cd) to Xn,m(g>) in the noise 
period become zero after passing through the filters 12i to 12m and the adder 
13, this means that noise could be reduced; hence, the condition for noise 
reduction is given by the following equation (17). 

15 (Xn,i((o), Xn,m(<»))H(o))= 0 (17) 

By satisfying both of Eq. (17) and Eq. (15) for adjusting the acquired 
sound level, mentioned previously in the first embodiment of the present 
invention, it is possible to implement both of the acquired sound level 
adjustment and the noise reduction. 

20 Next, solving the conditions of Eqs. (15) and (17) by the least square 

method for the filter coefficient matrix H(co) gives the following equation 
(18). 



H(CD) = I Z CskRskSk («>) + CnRnn (co)| Z Csk .I^RskSk (<^)^k (o>) 

lk=l J k=l V ^Sk 

(18) 

25 Cn is a weight constant for the noise reduction rate; an increase in the value of 
the constant increases the noise reduction rate. But, since an increase in Cn 



decreases the sensitivity constraint for the sound source position and increases 
degradation of the frequency characteristics of the acquired sound signal, CN 
is normally set at an appropriate value approximately in the range of 0. 1 to 
10.0. The meanings of the other symbols are the same as in the first 
embodiment. 

Next, the filter coefficients calculated by Eq. (18) are set in the filters 
12i to 12m and used to filter the microphone acquired signals. The filtered 
signals are added together by the adder 13, from the added signal is provided 
as the output signal. 

As described above, the second embodiment of the present invention 
permits reduction of noise in addition to the effect of acquired sound level 
adjustment in the first embodiment of the present invention. 

The other parts of this embodiment are the same as in the first 
embodiment of the present invention, and hence they vs^ill not be described. 
THIRD EMBODIMENT 

Fig. 10 is a functional block diagram of a sound acquisition apparatus 
according to a third embodiment of the present invention. 

The sound acquisition apparatus of this embodiment comprises a 
loudspeaker 22, microphones 11 1 to 11m, filters 12i to 12m and 23, an adder 
13, a state decision part 14, a sound source position detecting part 15, a 
frequency domain converting part 16, a covariance matrix calculating part 17, 
a covariance matrix storage part 18, an acquired sound level estimating part 
19, arid a filter coefficient calculating part 21. 

This embodiment adds, to the sound acquisition apparatus of the 
second embodiment of the present invention, the loudspeaker 22 for 
reproducing a received signal from a participating speaker at a remote 
location and the filter 23 for filtering the received signal, with a view to 
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implementing, in addition to the acquired sound level adjustment and the 
noise reduction by the second embodiment, cancellation of acoustic echoes 
that are loudspeaker reproduced signal components which are acquired by the 
microphones lli to Hm- 
5 The state decision part 14 has, in addition to the Fig. 4 configuration 

of the state decision part 14 as shown in Fig. lira short-time mean power 
calculating part 14B' and a long-time mean power calculating part 14C' for 
calculating short-time mean power P'avs and long-time mean power P'avL of 
the received signal, respectively; a division part 14D' for calculating their 

10 ratio R'p=P'avs/P'acL; a receive decision part 14G that compares the ratio R'p 
with a predetermined receive signal threshold value RthR and, if the former is 
larger than the latter, decides the state as a receiving period; and a state 
determining part 14H that determines the state based on the results of decision 
by the utterance decision part 14E, the noise decision part 14F and the receive 

15 decision part 14G When the result of decision by the receive decision part 
14G is the receiving period, the state determining part 14H determines the 
state as the receiving period, irrespective of the results of decision by the 
utterance decision part 14E and the noise decision part 14F, whereas when the 
receive decision part 14G decides that the state is not the receiving period, the 

20 state determining part determines the state as the utterance or noise period 
according to the decisions by the utterance decision part 14E and the noise 
decision part 14F as in the case of Fig. 4. 

When the result of decision by the state decision part 14 is the 
utterance period, the sound source position detecting part 15 detects the 

25 position of the sound source concemed in the same manner as in the first 
embodiment of the present invention. 

Next, the frequency domain converting part 16 converts the 
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microphone acquired signals and the received signal to frequency domain 
signals Xi((o), . . Xm(co) and Z((jo), and the covariance matrix calculating part 
1 7 calculates covariance matrices of the frequency domain acquired signals 
and received signal. A covariance matrix Rxx(co) of the frequency domain 
converted signals Xi(cd) to Xm(co) of the microphone acquired signals and the 
frequency domain converted signal Z(g)) is calculated by the following 
equation (19). 

^ Z(co) ^ 

(Z(a))*X,(o))*-..XM(co)*) (19) 



RxxM = 



Xi(co) 
Xm(co) 



where * represents a complex conjugate. 

10 Next, in the covariance matrix storage part 1 8, based on the result of 

detection by the sound source position detecting part 1 5 and on the result of 
decision by the state decision part 14, the covariance matrix Rxx(co) is stored 
as covariance matrices RskSk(o3) of the acquired signals and the received 
signal for each sound source 9k in the utterance period, as a covariance matrix 

15 Rnn(g)) of the acquired signals and the received signal in the noise period, and 
as a covariance matrix Ree(o)) of the acquired signals and the received signal 
in the receiving period in areas MAi, . . MAk, MA^+i, respectively. 

The acquired sound level estimating part 19 calculates the acquired 
sound level Psk for each sound source 9|c by the following equation (20) based 

20 on the covariance matrices Rsisw • • Rsksk for each sound source and 

predetermined weighted mixing vectors A 1(00), Ak(co) composed of M+1 
elements for each sound source. 
1 w 

Psk =777 ZAk(0))"RskSlc(»)Ak(0>) (20) 

w 0=0 
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Next, the filter coefficient calculating part 21 calculates filter 
coefficients for acquiring, at a desired volume, speech sound uttered from 
each sound source. Let Hi(g)) to Hm(o)) represent frequency domain 
converted versions of the filter coefficients of the filters 12i to 12m connected 
to the microphones, respectively, and let F((o) represent a frequency domain 
converted version of the filter coefficient of the filter 23 for filtering the 
received signal. Then, let H(a)) represent a matrix of these filter coefficients 
given by the following equation (21). 

^ F(co) ^ 

(21) 



H(co) = 



H,(a>) 



.HM(co)j 



10 Further, let Xe,i(co) to XE,M(cj^) represent frequency domain converted 

signals of the microphone acquired signal in the receiving period; let Ze((o) 
represent a frequency domain converted signal of the received signal; let 
Xn,i(cd) to Xn,m(co) represent frequency domain converted signals of the 
microphone acquired signals in the noise period; let Zn(co) represent a 

15 frequency domain converted signal of the received signal; let Xsk,i(o3) to 
Xsk,M(<x>) represent frequency domain converted signals of the microphone 
acquired signals in the utterance period of the k-th sound source 9k; and let 
Zsk(co) represent a frequency domain converted signal of the received signal. 
In this case, the condition that the filter coefficient matrix H((d) needs 

20 to meet is that when the microphone acquired signals and the send signal are 
each subjected to filtering with the filter coefficient matrix H(cd) and the 
signals after filtering are added together, acoustic echo and noise signals are 
cancelled and only the send speech signal is sent at a desired level. 

Accordingly, for the signals during the receiving period and the noise 
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period, the following equations (22) and (23) are ideal conditions by which 
the filtered and added signals become 0. 

(Ze(o)) Xe,i(o>) ■■ XE,M(o>)jH(a)) = 0 (22) 

(co)JH(©) = 0 (23) 
For the signal during the utterance period, the following equation is an ideal 
condition by which the filtered and added signal becomes equivalent to a 
signal obtained by multiplying the microphone acquired signals and the 
received signal by the weighted mixing vector Ak(co) composed of 
predetermined M+1 elements and a desired gain. 

(Zsk («>) Xsk,i (CO) • • • Xsk,M (co))h(co) = J-^(Z 

Sk (o) Xsk,i (co) • • • Xsk,M 

(co))Ak((o) 

V ^Sk 



(24) 

The element ao((D) of the weighted mixing vector Ak(co)=(ao(cD),aki(ca), . . 
akM(co)) represents a weighting factor for the received signal; normally, it is 
set at ao(co)=0. 

15 Next, solving the conditions of Eqs. (22) to (24) by the least square 

method for the filter coefficient matrix H(o)) gives the following equation: 



H(co) = j Z CskRskSk (CO) + CnRnn i^) + C^^ee (o))| Z Csk .l^RskSk (co) (co) 

lk=l J k=l V *^Sk 

(25) 

Ce is a weight constant for acoustic echo return loss enhancement; the larger 
20 the value, the more the acoustic echo retum loss enhancement increases. But, 
an increase in the value Ce accelerates deterioration of the frequency 
characteristics of the acquired signal and lowers the noise reduction 
characteristic. On this account, Ce is usually set at an appropriate value 
approximately in the range of 0.1 to 10.0. The meanings of the other 
25 symbols are the same as in the second embodiment. 
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In this way, the filter coefficients can be determined in such a manner 
as to adjust volume and reduce noise. 

Next, the filter coefficients, obtained by Eq. (25), are set in the filters 
12i to 12m and 23, which filter the microphone acquired signals and the 
5 received signal, respectively. The filtered signals are added together by the 
adder 13, from which the added signal is output as the send signal. The 
other parts are the same as in the second embodiment of the present invention, 
and hence no description will be repeated. 

As described above, the third embodiment of the present invention 
10 permits implementation of acoustic echo cancellation in addition to the effects 
of acquired sound level adjustment and noise reduction by the second 
embodiment of the present invention. While the third embodiment has been 
described as adding the acoustic echo cancellation capability to the second 
embodiment, the acoustic echo cancellation capability may also be added to 
15 the first embodiment. In such an instance, the noise decision part 14F is 

removed in Fig. 1 1 showing in detail the state decision part 14 in Fig. 10, and 
the covariance matrix calculating part 17 in Fig. 10 does not calculate the 
covariance matrix Rnn(o)) in the noise period. Accordingly, the calculation 
of filter coefficients in the filter coefficient calculating part 21 may be carried 
2 0 out by the following equation, which is evident fi*om the foregoing 
description. 

H(co) = I i CskRskSk M + CeRee i ./^RskSk («)Ak (co) 

U=i J k=i V^sk 

(26) 

FOURTH EMBODIMENT 
2 5 Though described above as an embodiment having added the acoustic 

echo cancellation capability to the acquired sound level adjustment and noise 
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reduction capabilities of the second embodiment, the third embodiment of Fig. 
10 may also be configured as a sound acquisition apparatus equipped with 
only the noise reduction and acoustic echo cancellation capabilities. An 
example of such a configuration is shown in Fig. 12. 
5 As illustrated in Fig. 12, this embodiment has a configuration in which 

the sound source position detecting part 15 and the acquired sound level 
estimating part 19 in the Fig. 10 configuration are removed and the 
covariance matrix calculating part 17 calculates a covariance matrix Rss(o)) of 
the. send signal, a covariance matrix Ree(co) of the received signal, and a 

10 covariance matrix Rnn(co) of the noise signal, which are stored in storage 
areas MAs, MAe and MAn of the covariance storage part 1 8, respectively. 
The acoustic echo cancellation capability can be implemented using at least 
one microphone, but an example using M microphones is shown. 

The state decision part 14 decides, as in the Fig. 10 embodiment, the 

15 utterance period, the receiving period, and the noise period from the signals 
acquired by the microphones 12i to 12m and the received signal; the state 
decision part is identical in concrete configuration and in operation with the 
counterpart depicted in Fig. 1 1 . The acquired signals and the received signal 
are converted by the fi-equency domain converting part 16 to frequency 

20 domain acquired signals Xi(co) to Xm(ci)) and a frequency domain received 
signal Z(co), which are provided to the covariance matrix calculating part 17. 

Next, the covariance matrix calculating part 17 generates a covariance 
matrix of the fx-equency domain acquired signals and received signal. The 
covariance matrix Rxx(co) of the frequency domain converted signals Xi(co) 

25 to Xm(o)) of the microphone acquired signals and the fi*equency domain 
converted signal Z(ca) of the received signal is calculated by the following 
equation (27). 
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Z(co) 
X,(co) 



Rxx(^) = 



(Z(co)*X,(co)*-Xm(co)*) 



(27) 



where * represents a complex conjugate. 

Next, in the covariance matrix storage part 18, based on the result of 
detection by the state decision part 14, the covariance matrix Rxx(g)) is stored 
5 as a covariance matrix Rss (co) of the acquired signals and the received signal 
for each sound source 9k in the utterance period, as a covariance matrix 
Rnn(g>) of the acquired signals and the received signal in the noise period, and 
as a covariance matrix Ree(co) of the acquired signals and the received signal 
in the receiving period in areas MAs, MAn, and MAe, respectively. 

10 Next, the filter coefficient calculating part 21 acquires speech sound 

uttered fi-om sound sources, and calculates filter coefficients for canceling 
acoustic echo and noise. Let Hi(co) to Hm(co) represent frequency domain 
converted versions of the filter coefficients of the filters 12i to 12m connected 
to the microphones 1 1 1 to 1 1m, respectively, and let F(a)) represent a 

15 frequency domain converted version of the filter coefficient of the filter 23 for 
filtering the received signal. Then, let H(ca) represent a matrix of these filter 
coefficients given by the following equation (28). 



(^HM(co)j 

Further, let Xej((o) to Xe,m(g>) represent frequency domain converted 
20 signals of the microphone acquired signals in the receiving period; let Ze((o) 
represent a frequency domain converted signal of the received signal; let 



F(a)) > 
H,(to) 



H(CD) = 



(28) 
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Xnj(cd) to Xn,m(co) represent frequency domain converted signals of the 
microphone acquired signals in the noise period; let Zn(g)) represent a 
frequency domain converted signal of the received signal; let Xsk,i(co) to 
Xsk,M(o3) represent frequency domain converted signals of the microphone 
5 acquired signals in the utterance period; and let Zs (co) represent a frequency 
domain converted signal of the received signal in the utterance period. 

In this case, the condition that the filter coefficient matrix H(co) needs 
to meet is that w^hen the microphone acquired signals and the send signal are 
each subjected to filtering with the filter coefficient matrix H(a)) and the 
10 signals after filtering are added together, acoustic echo and noise signals are 
cancelled and only the send speech signal is sent at a desired level. 

Accordingly, for the signals during the received signal period and the 
noise period, the following equations (29) and (30) are ideal conditions by 
which the filtered and added signals become 0. 
15 (Ze((o) Xe,i(co) ... Xe,m(o>))H(o>) = 0 (29) 

(Zn(co) Xn,i(co) XN,M(o>))H(ca) = 0 (30) 
For the signal during the utterance period, the following equation is an ideal 
condition by which the filtered and added signal becomes equivalent to a 
signal obtained by multiplying the microphone acquired signals and the 
20 received signal by the weighted mixing vector A (co) composed of 
predetermined M+1 elements. 

(Zs(o)) Xs,i((D) Xs,m(^o)JH(cd) = (Zs(a)) Xs,i(a)) Xs,m(<^)Ja(o>) 

(31) 

The first element ao(co) of the weighted mixing vector A (cD)=(ao(cD),aki(cD), ... 
25 akM(<i>)) represents a weighting factor for the received signal; normally, it is 
set at ao(co)=0. 

Next, solving the conditions of Eqs. (29) to (3 1) by the least square 
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method for the filter coefficient matrix H(a)) gives the following equation: 
H(o>) = {Rss(co) + CNRNN(<o) + CEREE(c^)r^Rss(<o)A(co) 

(32) 

Ce is a weight constant for acoustic echo return loss enhancement; the larger 
5 the value of the weight constant, the more the acoustic echo retum loss 
enhancement increases. But, an increase in the value Ce accelerates 
deterioration of the frequency characteristics of the acquired signal and lowers 
the noise reduction characteristic. On this account, Ce is usually set at an 
appropriate value approximately in the range of 0.1 to 10.0. The meanings 
10 of the other symbols are the same as in the second embodiment. 

In this way, the filter coefficients can be determined in such a manner 
as to adjust volume and reduce noise. 

Next, the filter coefficients, obtained by Eq. (32), are set in the filters 
12i to 12m and 23, which filter the microphone acquired signals and the 
15 received signal, respectively. The filtered signals are added together by the 
adder 13, from which the added signal is output as the send signal. The 
other parts are the same as in the second embodiment of the present invention, 
and hence no description will be repeated. 

As described above, the fourth embodiment of the present invention 
20 permits implementation of acoustic echo cancellation in addition to the effect 
of noise reduction. 
FIFTH EMBODIMENT 

Fig. 13 illustrates a fifth embodiment. According to the fifth 
embodiment, in the fourth embodiment of Fig. 12, sound source positions are 
25 detected during the utterance period, a covariance matrix is calculated for 

each sound source and stored and during the noise period a covariance matrix 
for noise is calculated and stored. Then, these stored covariance matrices are 
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used to calculate filter coefficients for canceling noise and acoustic echo. 
The microphone acquired signals and the received signal are filtered using 
these filter coefficients to thereby obtain a send signal from which noise and 
acoustic echo have been cancelled. 
5 The configuration of the fifth embodiment is common to the 

configuration of the third embodiment except the removal of the acquired 
sound level estimating part 19 in Fig. 10. 

The state decision part 14 detects the utterance period, the receiving 
period and the noise period as in the third embodiment. When the result of 
10 decision by the state decision part 14 is the utterance period, the soimd source 
position detecting part 1 5 estimates the position of each sound source 9k. 
The sound source position estimating method is the same as that used in the 
first embodiment of Fig. 1, no description will be repeated. 

Next, in the frequency domain converting part 16 the acquired signals 
15 and the received signal are converted to frequency domain signals, which are 
provided to the co variance matrix calculating part 17. 

The covariance matrix calculating part 17 calculates covariance 
matrices Rsisi(co) to Rsksk(o3) of the acquires signals for the respective sound 
sources 9k and the received signal, a covariance matrix Ree(co) in the 
20 receiving period and a covariance matrix Rnn(co) in the noise period. The 
covariance matrix storage part 18 stores the covariance matrices Rsisi(g>) to 
Rsksk(o)), Ree(<^) and R>fN(o)) in the corresponding areas MAi to, MAk, 
MAk+1 and MAk:+25 respectively, based on the result of decision by the state 
decision part 14 and the results of position detection by the sound source 
2 5 position detecting part 15. 

Upon the send speech sound being acquired, the filter coefficient 
calculating part 21 calculates filter coefficients for canceling acoustic echo 
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and noise. As is the case with the third embodiment, solving the conditional 
expression for the filter coefficient matrix H(co) by the least square method 
gives the following equation: 

H(0)) = j ZCskRskSk(0>) + CNRNN(<») + CEREE(<0)| icskRskSk(fi>)Ak(0)) 

Lk=l J k=l 

5 (33) 

In the above, Csi to Csk are weight constants of sensitivity constraints for the 
respective sound sources, Ce is a weight constant for the echo return loss 
enhancement, and Cn is a weight constant for the noise reduction rate. 

The filter coefficients thus obtained are set in the filters 12] to 12m and 
10 23, which filter the microphone acquired sound signals and the received 

signal, respectively. The filtered signals are added together by the adder 13, 
fi-om which the added signal is output as the send signal. The other parts are 
the same as in the second embodiment of the present invention, and hence no 
description will be repeated. The fifth embodiment permits generation of a 
15 send signal having cancelled therefi'om acoustic echo and noise as is the case 
with the third embodiment. Further, according to the fifth embodiment, 
sensitivity constraints can be imposed on a plurality of sound sources, and 
sensitivity can be held for a sound source having uttered speech sound 
previously as well. Accordingly, this embodiment is advantageous in that 
20 that even when the sound source moves, the speech quality does not 

deteriorate in the initial part of the speech sound since the sensitivity for the 
sound source is maintained if it has uttered speech sound in the past. 
SIXTH EMBODIMENT 

A sound acquisition apparatus according to a sixth embodiment of the 
25 present invention will be described. 

In the sound acquisition apparatus of this embodiment, the weighting 
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factors Csi to Csk of the sensitivity constraints for the sound source positions 
9k in the sound acquisition apparatuses of the first to third and fifth 
embodiments are changed on a timewise basis. 

The time- variant weighting factors Csi to Csk of the sensitivity 
5 constraints for the sound sources 9 1 to 9k are set smaller in order of utterance 
in the past. A first method is to reduce the weighting factor Csk with an 
increase in the elapsed time from the detection of each already detected sound 
source position to the detection of the most recently detected sound source 
position. A second method is to set the weighting factor Csk smaller in order 

10 of detection of K sound source positions. 

Fig. 14 illustrates in block form the functional configuration of a 
weighting factor setting part 21H for implementing the above-said first 
method. The weighting factor setting part 21H is made up of: a clock 21 HI 
that outputs time; a time storage part 21H2 that upon each detection of sound 

15 source position, overwrites the time t of detection, using, as an address, the 
number k representing the detected sound source 9k; and a weighting factor 
determining part 21H3. Based on the time of detection of the sound source 
position stored in the time storage part 21H2, the weighting factor 
determining part 21H3 assigns a predetermined value Cs as the weighting 

2 0 factor Sck to the currently detected sound source of a number k(t), and assigns 
a value q^^'^^^Cs as the weighting factor Csk to each of the other sound sources 
of numbers k^k(t) in accordance with the elapsed time t-tk after the detection 
time tk. q is a predetermined value in the range of 0<q<l . In this way, the 
weighting factors Csi to Csk of sensitivity constraints are determined for the 

25 respective sound sources, and they are provided to 21A1 to 21AK. 

Fig. 15 illustrates in block form the functional configuration of a 
weighting factor setting part 21H for implementing the above-said second 
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method; in this example, it is made up of a clock 21H1, a time storage part 
21H2, an order decision part 21H4, and a weighting factor determining part 
21H5. The order decision part 21H4 decides the order of detection of the 
positions of the sound sources 9i to 9k (the newest order) {k(t)}={k(l), 
5 k(K)} from the times stored in the time storage part 21H2. The weighting 
factor determining part 21H5 assigns a predetermined value Cs as a weighting 
factor Csk{i) to the most recently detected sound source 9k(i). For the other 
sound sources, the weighting factor determining part calculates 
Csk(t+i)<-qCsk(t) for t = 1 ,2, . . . , K- 1 to obtain weighting factors Csk(2), • - • , Csk(t)- 

10 These weighting factors Csk(2) to Csk(t) are rearranged following the order 
{k(l), . . k(K)}, thereafter being output as weighting factors Csi, . . Csk- 
The value of q is a preset value in the range of 0<q<l . 

By varying the weights of sensitivity constrains for the respective 
sound sources as described above, it is possible to reduce the sensitivity 

15 constrains for the sound source positions where utterance was made in the 
past. Thus, as compared with the sound acquisition apparatuses of the first 
to third embodiments, the apparatus of this embodiment reduces the number 
of sound sources to be subjected to sensitivity constraints, enhancing the 
acquired sound level adjustment capability and the noise and acoustic echo 

20 cancellation functions. 

The other parts are the same as those in the first to third and fifth 
embodiments of the present invention, and hence no description will be 
repeated. 

SEVENTH EMBODIMENT 
25 A sound acquisition apparatus according to a seventh embodiment of 

the present invention will be described. 

The sound acquisition apparatus according to the seventh embodiment 
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of the present invention features whitening the covariance matrix Rxx(o)) in 
the filter coefficient calculating part 21 of the sound acquisition apparatus 
according to the first to sixth embodiments of the present invention. Fig. 16 
illustrates the functional configuration of a representative one of whitening 
5 parts 21J1 to21JK indicated by the broken lines in the filter coefficient 
calculating part 21 shown in Fig. 4. The whitening part 21 J comprises a 
diagonal matrix calculating part 21 JA, a weighting part 21 JB, an inverse 
operation part 21 JC and a multiplication part 21 JD. The diagonal matrix 
calculating part 21 JA generates a diagonal matrix diag(Rxx(o3))of the 

10 covariance matrix Rxx(g>) fed thereto. The weighting part 21JB assign 

weights to the diagonal matrix by calculating the following equation based on 
a matrix D of a predetermined arbitrary M or M+1 rows. 

D'^diag(Rxx(o)))D (34) 
The inverse calculation part 21JC calculates an inverse of Eq. (34) 

15 l/{D^diag(Rxx(o>))D} (35) 

In the above ^ indicates a transpose of the matrix. In the multiplication part 
21JD the result of calculation by the inverse calculation part 21 JC is 
multiplied by each covariance matrix Rxx(o)) input thereto to obtain a 
whitened covariance matrix. 

20 With the covariance matrix thus whitened, the filter coefficients 

obtained in the filter coefficient calculating part 21 no linger change with 
spectral changes of the send signal, acquired signal and the noise signal. As 
a result, the acquired sound level adjustment capability and the acoustic echo 
and noise cancellation capabilities do not change with the spectral 

25 changes — ^this makes it possible to achieve steady acquired sound level 
adjustment and acoustic echo and noise cancellation. 

The other parts are the same as in the first to fourth embodiments of 
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the present invention, and hence no description will be repeated. 
EIGHTH EMBODIMENT 

A sound acquisition apparatus according to an eighth embodiment of 
the present invention will be described. 
5 The sound acquisition apparatus of the eighth embodiment features 

that the covariance matrix storage part 18 of the sound acquisition apparatus 
according to the first to seventh embodiments of the present invention 
averages an already stored covariance matrix and a covariance matrix newly 
calculated by the covariance matrix calculating part 1 7 and stores the 

10 averaged covariance matrix as the current one. 

The covariance matrices are averaged, for example, by the method 
described below. Letting the already stored covariance matrix be 
represented by Rxx,oid(co) and the covariance matrix newly calculated by the 
covariance matrix calculating part 17 by Rxx,new(co), the following equation is 

15 used to calculate an average covariance matrix Rxx(o3). 

RXX(CO) = (1 -p)RxX,new(C0) + pRxx,oId(0)) (36) 

where p is a constant that determines the time constant of the average and 
takes a value 0<p<l . 

Fig. 1 7 illustrates the functional configurations of the covariance 

20 matrix storage part 18 and an averaging part 18A provided therein. The 
averaging part ISA comprises a multiplier 18A1, an adder 18A2, and a 
multiplier 18A3. The covariance matrix RskSk(G>) corresponding to the 
sound source 9k, calculated by the covariance matrix calculating part 1 7, is 
provided as a new covariance matrix RskSk,new(co) to the multiplier 18A1 and is 

25 multiplied by (1-p), and the multiplied output is applied to the adder 18A2. 
On the other hand, the covariance matrix RskSk(<o) corresponding to the sound 
source 9k is read out of the storage area 1 8B then provided as a old covariance 
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matrix RskSk,oid(co) to the multiplier 1 8A3 and multiplied by the constant p. 
The multiplied output is added by the adder 1 8 A2 to the output 
(l-p)Rsksk,new(G>) from the multiplier 18A1, and the thus obtained average 
covariance matrix RskSk(G>) is overwritten in the storage area corresponding to 
5 the sound source 9^. 

By averaging covariance matrices and storing the averaged covariance 
matrix as described above, it is possible to lessen the influence of a circuit 
noise or similar disturbance as compared with that before averaging, hence 
providing an accurate covariance matrix — ^this makes it possible to determine 
10 filter coeflRcients that enhance the acquired sound level adjustment, noise 
cancellation or acoustic echo cancellation performance. 

The other parts are the same as in the first to fifth embodiments of the 
present invention, and hence no description will be repeated. 

15 Incidentally, the present invention can be implemented by dedicated 

hardware; altematively, it is possible that a program for implementing the 
invention is recorded on a computer-readable recording medium and read into 
a computer for execution. The computer-readable recording medium refers 
to a storage device such as a floppy disk, an magneto-optical disk, CD-ROM, 

20 DVD-ROM, a nonvolatile semiconductor memory, or an intemal or extemal 
hard disk. The computer-readable recording medium also includes a 
medium that dynamically holds a program for a short period of time (a 
transmission medium or transmission wave) as in the case of transmitting the 
program via the Internet, and a medium that holds the program for a fixed 

2 5 period of time, such as a volatile memory in the computer system serving as a 
server in that case. 
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EFFECT OF THE INVENTION 

Next, to demonstrate the effectiveness of the first embodiment of the 
sound acquisition apparatus according to the present invention. Figs. 18A and 
1 8B show the resuhs of simulations with microphones disposed at comers of 
5 a square measuring 20 cm by 20 cm. The simulation conditions are — 
number of microphones: 4, signal-to-noise ratio: about 20 dB, room 
reverberation time: 300 ms, and number of speakers: 2 (speaker A at a 
position 50 cm away from the center of the square in a direction at right 
angles to one side thereof, speaker B at a position 200 cm away from the 

10 center of the square in a direction at 90° to the speaker A). Fig. 1 8 A shows 
microphone received signal waveforms obtained when the speakers A and B 
spoke alternately under the above-mentioned conditions. Comparison 
between the speech waveforms of the speakers A and B indicates that the 
speech waveform of the speaker B is small in amplitude. Fig. 18B shows 

15 waveforms processed by the present invention. The speech waveforms of 
the speakers A and B are nearly equal in amplitude, from which the effect of 
acquired sound level adjustment can be confirmed. 

Fig. 19 shows simulation results obtained with the third embodiment 
shown in Fig. 10. The simulation conditions are — ^number M of 

20 microphones: 4, signal-to-noise ratio of send signal before processed: 20 dB, 
send signal-to-acoustic echo ratio: -10 dB, and room reverberation time: 300 
msec. Fig. 19 shows the send signal levels obtained when signal sending 
and receiving are repeated alternately under the above-mentioned conditions. 
Row A shows the send signal level before processing, and Row B the send 

25 signal level after processing by the third embodiment. The above results 

indicate that the third embodiment reduces the acoustic echo about 40 dB and 
the noise signal about 15 dB, from which it can be confirmed that the 
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embodiment of the invention is effective. 

As described above, according to the first embodiment of the present 
invention, it is possible to obtain a send signal of a volume adjusted for each 
sound source position by: detecting the sound source position from signals 
5 picked up by a plurality of microphones; calculating filter coefficients based 
on a covariance matrix in the utterance period for each soimd source position; 
filtering the microphone acquired signals by the filter coefficients; and adding 
the filtered signals. 

According to the second embodiment of the present invention, it is 
10 possible to achieve noise cancellation as well as the acquired sound level 

adjustment by determining the filter coefficients by using a covariance matrix 
in the noise period in addition to the covariance in the utterance period in the 
first embodiment. 

According to the third embodiment of the present invention, it is 
15 possible to achieve acoustic cancellation by determining the filter coefficients 
by using a covariance matrix in the receiving period in addition to the 
covariance matrix in the utterance period in the first or second embodiment. 

According to the fourth embodiment of the present invention, it is 
possible to reproduce the received signal by a loudspeaker and cancel acoustic 
20 echo by determining the filter coefficients by using the covariance matrix in 
the utterance period and the covariance matrix in the receiving period. 

According to the fifth embodiment of the present invention, it is 
possible to further cancel noise by determining the filter coefficients by using 
the covariance matrix in the noise period in addition to the covariance 
25 matrices in the utterance and receiving periods in the fourth embodiment. 

According to the sixth embodiment of the present invention, it is 
possible to further enhance the acquired sound level adjustment, noise 
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cancellation or acoustic echo cancellation performance by assigning a smaller 
weighting factor to the covariance matrix of older utterance at the time of 
calculating the filter coefficients in the first, second, third and fifth 
embodiments. 

5 According to the seventh embodiment of the present invention, it is 

possible to implement acquired sound level adjustment, noise cancellation and 
acoustic echo cancellation not so susceptible to signal spectral changes by 
whitening the covariance matrix at the time of calculating the filter 
coefficients in the first to sixth embodiment. 

10 According to the eighth embodiment of the present invention, when 

the covariance matrix is stored in the first to seventh embodiments, the 
covariance matrix and that already stored in the corresponding area are 
averaged and a weighted mean covariance matrix is stored, by which it is 
possible to obtain a more accurate covariance matrix and determine filter 

15 coefficients that provide increased performance in the acquired sound level 
adjustment, noise reduction and acoustic echo cancellation. 
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